MALT: Distributed Data-Parallelism for Existing ML Applications

ABSTRACT

Systems and methods are disclosed for parallel machine learning with a cluster of N parallel machine network nodes by determining k network nodes as a subset of the N network nodes to update learning parameters, wherein k is selected to disseminate the updates across all nodes directly or indirectly and to optimize predetermined goals including freshness, balanced communication and computation ratio in the cluster; sending learning unit updates to fewer nodes to reduce communication costs with learning convergence; and sending reduced learning updates and ensuring that the nodes send/receive learning updates in a uniform fashion.

This application claims priority to Provision Applications 62/061,284filed Oct. 8, 2014 and 62/144,648 filed Apr. 8, 2015, the contents ofwhich are incorporated by reference.

BACKGROUND

The present invention relates to machine learning.

A machine learning model or unit is a set of weights (or parameters),over features. Applying a new input data to the machine learningmodel/unit, gives a prediction output to a classification or regressionproblem. Distributed machine learning involves training amachine-learning unit in parallel. This consists of a cluster ofmachines, each one training one or more unit replicas in parallel, withdata being split across the replicas. Each replica trains on a subset ofdata and incrementally updates the machine-learning unit everyiteration. In order to ensure that each unit is created from all data,each of these replicas communicates the unit parameter values toone-another. The replicas merge the incoming unit and continue trainingover local data.

A number of platforms have implemented distributed machine learning. Forexample, Map-Reduce/Hadoop platform communicates unit updates using thefile system. Hadoop uses the map-reduce paradigm. The map step consistsof all replicas creating a trained unit. In the reduce step, theparallel replicas, pick up the unit from the file system and apply totheir unit. Since Hadoop communicates using the filesystem, the trainingspeed is limited to disk performance. Another platform is Spark which isan in-memory data processing platform that stores objects in memory asimmutable objects. The data is stored as distributed objects. Eachworker trains on a set of data and updates the unit. Spark and Hadoopare based on the map-reduce paradigm and perform bulk-synchronousprocessing of creating a machine learning unit. This is because bothHadoop and Spark are deterministic and have explicit training, updateand merge steps. Synchronous unit training is slow and with a largenumber of workers, it can be too slow to be practical.

A third paradigm, a dedicated parameter server collects all unit updatesand sends out the unit-updates to all network nodes. In these systems,all parallel units send unit updates to a single server and receive theupdated unit. Hence, the parameter server receives the units, updatesthem to create a new unit and sends it to all replicas. While thissystem can train in asynchronous fashion, it is not fully asynchronous,since it requires the workers to wait for an updated unit to arrive fromthe parameter server.

SUMMARY

In one aspect, systems and methods are disclosed for parallel machinelearning with a cluster of N parallel machine network nodes bydetermining k network nodes as a subset of the N network nodes to updatelearning parameters, wherein k is selected to disseminate the updatesacross all nodes directly or indirectly and to optimize predeterminedgoals including freshness, balanced communication and computation ratioin the cluster; sending learning unit updates to fewer nodes to reducecommunication costs with learning convergence; and sending reducedlearning updates and ensuring that the nodes send/receive learningupdates in a uniform fashion.

In another aspect, systems and methods are disclosed for providingdistributed learning over a plurality of parallel machine network nodesby allocating a per-sender receive queue at every machine network nodeand performing distributed in-memory training; and training each unitreplica and maintaining multiple copies of the unit replica beingtrained, wherein all unit replicas train, receive unit updates and mergein parallel in a peer-to-peer fashion, wherein each receiving machinenetwork node merges updates at later point in time without interruptionand wherein the propagating and synchronizing unit replica updates arelockless and asynchronous.

Advantages of the system may include one or more of the following. Thesystem can synchronize the units across the parallel replicas such thatthe synchronization overhead is low. The system's distributed in-memorytraining allows propagating and synchronizing unit updates to belockless and completely asynchronous. By using extra space at everysender for providing asynchronous training, the system can reduce theunit training time. Reducing model or unit training times leads tobetter units being produced at shorter intervals. The reduced trainingtimes also help in parameter tuning i.e. picking the right set ofparameters to tune.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows exemplary MALT processing machines with data and unit/modelparameters.

FIG. 2 shows an exemplary All-reduce exchange of unit updates.

FIG. 3 shows an exemplary Halton-sequence exchange of unit updates(N=6).

DESCRIPTION

FIG. 1 shows processing machines with data and model/unit parameters inone embodiment called MALT. MALT provides a lockless in-memory learningsolution. In MALT, instead of a single parameter server, all models orunits train, receive the unit updates and merge in parallel in apeer-to-peer fashion. MALT allocates a per-sender receive queue at everynetwork node/process training a unit replica and maintain multiplecopies of the machine learning unit being trained. The receiver mergesthese updates at later point in time and is not interrupted. Thesemachines train machine-learning units on data iteratively. Everyiteration produces a unit update. These updates are propagated to allother machines that are training in parallel with portions of data fromthe same dataset. Each replica stores, per-sender unit (shown as V2, V3,. . . , Vn in V1; V1, . . . Vn in V2).

When parallel replicas train their units and send their updates, theincoming units are stored in these receive queues on every network node.At the end of every iteration (or every few iterations), each unitprocesses the incoming units and updates its current unit. As a result,unit replicas can train in parallel and send their updates to oneanother, in a fully asynchronous fashion. The replicas send theirupdates and proceed to the next iteration. The receiver is notinterrupted either as it only looks up the queue at the end of itsiteration cycle, and performs a merge of all received updates.

This approach can be extended using hardware support (such as RDMA) tofurther reduce receive side interruption. RDMA allows data to bedirectly sent to receiver's memory without interrupting receiver sideCPU. When unit updates are propagated using RDMA, it allows for fullyasynchronous unit training. The receive side is not interrupted to mergeits updates. The receive side CPU is not even interrupted to receivedthe update and store it in a receive queue, since this operation isperformed in hardware using RDMA.

Machine learning methods generalize from data. Machine learning methodstrain over data to create a unit representation that can predictoutcomes (regression or classification) for new unseen data. Moreformally, given a training set {(x₁, y₁), (x₂, y₂), . . . , (x_(n),y_(n))} the goal of unit training is to determine the distributionfunction ƒ such that y=ƒ(x, w). The input x may consist of differentfeatures and the unit consists of parameters w, representing the weightsof individual features to compute y. The goal of unit training is toestimate the values of unit parameters w. During unit testing, this unitis tested using an unseen set of x_(t) to compare against ground truth(already known y_(t)), to determine the unit accuracy. Thus, machinelearning methods train to minimize the loss, which represents somefunction that evaluates the difference between estimated and true valuesfor the test data.

Learning model/unit training methods are iterative, and the methodstarts with an initial guess of the unit parameters and learnsincrementally over data, and refines the unit every iteration, toconverge to a final acceptable value of the unit parameters. Unittraining time can last from minutes to weeks and is often the mosttime-consuming aspect of the learning process. Unit training time alsohurts unit refinement process since longer training times limit thenumber of times the unit configuration parameters (calledhyper-parameters) can be tuned through re-execution.

Machine learning methods can benefit from a scale-out computing platformsupport in multiple ways: First, these methods train on large amounts ofdata, which improves unit accuracy. Second, they can train large unitsthat have hundreds of billions of parameters or require largecomputation such as very large neural networks for large-scale imageclassification or genomic applications. Training with more data is doneby data parallelism, which requires replicating the unit over differentmachines with each unit training over a portion of data. The replicassynchronize the unit parameters after a fixed number of iterations.Training large units requires the unit to be split across multiplemachines, and is referred to as unit parallelism.

With datasets getting larger, there has been a recent focus toinvestigate online methods that can process data-sets incrementally suchas the gradient descent family of methods. Gradient descent methodscompute the gradient of a loss function over the entire set of trainingexamples. This gradient is used to update unit parameters to minimizethe loss function. Stochastic Gradient Descent (SGD) is a variant of theabove method that trains over one single example at time. With eachexample, the parameter vector is updated until the loss function yieldsan acceptable (low) value. SGD and its variants are preferred methods totrain over large data-sets because it can process large trainingdatasets in batches. Furthermore, gradient descent can be used for awide-range of methods such as regression, k-means, SVM,matrix-factorization and neural networks.

In data-parallel learning, unit replicas train over multiple machines.Each replica trains over a subset of data. There are several ways inwhich the individual unit parameters can be synchronized. We describethree such methods. First, units may train independently and synchronizeparameters when all parallel units finish training by exhausting theirtraining data. These methods are commonly used to train over Hadoopwhere communication costs are prohibitive. Also, while these units maytrain quickly because of limited communication between replicas, theymay require more passes over training data (each pass over training datais called an epoch) for acceptable convergence. Furthermore, fornon-convex problems, this method may not converge, since the parallelreplicas may be trapped in a different local minima, and averaging thesediverging units may return a unit with low accuracy.

The second method is the parameter server approach. Here, individualunits send their updates to a central parameter server (or a group ofparameter servers) and receive an updated unit from them. A third methodis the peer-to-peer approach (used in MALT), where parameters from unitreplicas train in parallel and are mixed every (or every few) iteration.The last two methods achieve good convergence, even when the parametersare communicated asynchronously. With MALT, we perform asynchronousparameter mixing with multiple parallel instances of unit replicas. Thisdesign allows developers to write code once, that runs everywhere onparallel replicas (no separate code for parameter server and client).This design also simplifies fault tolerance—a failed replica is removedfrom the parameter mixing step and its data is redistributed to otherreplicas. Finally, instead of performing simple gradient descent, MALTcan be used to implement averaging of gradients from its peers, whichprovides speedup in convergence for certain workloads.

The system provides distributed machine learning over existing MLsystems. MALT exposes an asynchronous parameter mixing API that can beintegrated into existing ML applications to provide data-parallellearning. Furthermore, this API is general enough to incorporatedifferent communication and representation choices as desired by themachine learning developer. MALT provides peer-to-peer learning byinterleaving gradient (changes to parameters) updates with parametervalues to limit network costs. In the next section, we describe MALTdesign.

Comparing with FIG. 1's exemplary MALT architecture, existingapplications run with modified gradient descent methods that receiveunit update (V) from replicas training on different data. Unit vectorsare created using a Vector Object Library that allows creation of sharedobjects. Each replica scatters its unit update after every (or everyfew) iteration and gathers all received updates before the nextiteration. Unit replicas train in parallel on different cores (or setsof cores) across different network nodes using existing ML libraries. MLlibraries use the MALT vector library to create unit parameters orgradients (updates to parameters) that need to be synchronized acrossmachines. These vectors communicate over DiSTributed One-sided RemoteMemory or dstorm. Furthermore, like other data-parallel frameworks, MALTloads data in unit-replicas from a distributed file-system such as NFSor HDFS. Developers use the MALT API to shard input data across replicasand send/receive gradients. Furthermore, developers can also specify thedataflow across replicas and make their methods fully asynchronous. Wenow describe the shared memory design, the MALT API that allowsdevelopers access to shared memory, fault tolerance and networkcommunication mechanisms that allow developers to balance communicationand computation.

Machine learning units train in parallel over sharded data andperiodically share unit updates after few iterations. The parallelreplicas may do so synchronously (referred to as the bulk-synchronousprocessing). However, this causes the training to proceed at the speedof the slowest machine in that iteration. Relaxing the synchronousrequirement speeds up unit training but may affect the accuracy of thegenerated unit. Since unit weights are approximate, applicationsdevelopers and researchers pick a point in this trade-off space(accuracy vs speed) depending on their application and systemguarantees. Furthermore, this accuracy can be improved by training formultiple epochs or increasing the amount of data at for training eachunit replica.

The original map-reduce design communicates results over GFS/HDFS.However, using disk for communication, results in poor performanceespecially for machine learning applications which may communicate asoften as every iteration. Spark provides immutable objects (RDDs) for anefficient in-memory representation across machines. Spark provides faulttolerance using lineage of RDDs as they are transformed acrossoperations. However, this enforces determinism in the order ofoperations. As a result, the immutability and determinism makes it lesssuitable for fine-grained, asynchronous operations. Furthermore, machinelearning applications may contain multiple updates to large sparsematrices or may need to propagate unit updates asynchronously acrossmachines and need first-class support for fine-grained and asynchronousoperations.

MALT's design provides efficient mechanisms to transmit unit updates.There has been a recent trend of wide availability for cheap and fastinfiniBand hardware and they are being explored for applications beyondHPC environments. RDMA over infiniBand allows low latency networking ofthe order of 1-3 micro-seconds by using user-space networking librariesand by re-implementing a portion of the network stack in hardware.Furthermore, the RDMA protocol does not interrupt the remote host CPUwhile accessing remote memory. RDMA is also available over Ethernet withthe newer RDMA over Converged Ethernet (RoCE) NICs that have comparableperformance to infiniBand. InfiniBand NICs are priced competitively with10G NICs, costing around $500 for 40 Gbps NICs and 800$ for 56 Gbps NICs(as of mid 2014). Finally, writes are faster than reads since they incurlower round-trip times. MALT uses one-sided RDMA writes to propagateunit updates across replicas.

We build dstorm (dstorm stands for DiSTributed One-sided Remote Memory)to facilitate efficient shared memory for ML workloads. In MALT, everymachine can create shared memory abstractions called segments via adstorm object. Each dstorm segment is created by supplying the objectsize and a directed dataflow graph. To facilitate one-sided writes, whena dstorm segment is created, the network nodes in the dataflowsynchronously create dstorm segments. dstorm registers a portion ofmemory on every network node with the infiniBand interface to facilitateone-sided RDMA operations. When a dstorm segment is transmitted by thesender, it appears at all its receivers (as described by the dataflow),without interrupting any of the receiver's CPU. We call this operationas scatter. Hence, a dstorm segment allocates space (a receive queue) inmultiples of the object size, for every sender in every machine tofacilitate the scatter operation. We use per-sender receive queues toavoid invoking the receiver CPU for resolving any write-write conflictsarising from multiple incoming unit updates from different senders.Hence, our design uses extra space with the per-sender receive queues tofacilitate lockless unit propagation using one-sided RDMA. Both thesemechanisms, the one sided RDMA and per-sender receive queues ensure thatthe scatter operation does not invoke the receive-side CPUs.

Once the received objects arrive in local per-sender receive queues,they can be read with a local gather operation. The gather function usesa user-defined function (UDF), such as an average, to collect theincoming updates. We also use queues on the sender side, allowingsenders to perform writes asynchronously. Additionally, the sender-sidequeues maintain a back-pressure in the network to avoid congestion.

The receiver does not know when its per-sender receive queues get filledunless the receiver is actively polling and consuming these items. Whenthe receive queue is full, the default behavior of dstorm is toover-write previously sent items in the queue. We discuss theconsistency behavior after we describe the vector abstraction to createshared vectors or tensors (multi-dimensional vectors) over the dstromobject.

We build a vector object library (VOL) over dstorm that allows creatingvector objects over shared memory. The goal of VOL is to 1) expose avector abstraction instead of shared memory abstraction (dstorm) and 2)to provide communication and representation optimizations. ML developerscan specify gradients or parameters as a VOL vector (or tensor) andspecify its representation (sparse or dense). They also specify a dataflow graph describing how the updates should be propagated in thecluster which is used to create the underlying dstorm segment.

Hence, creating a vector in turn creates a dstorm segment that allowsthis vector to be propagated to all machines as described in thedataflow graph. This dataflow describes which machines may send updatesto one another (in the simplest case, everyone may send their updates toeveryone). Hence, an edge in the graph from network node A to networknodes B and C implies that when network node A pushes a unit update, itis received by network nodes B and network node C. As different machinescompute unit updates, they scatter these updates to other remote networknodes without acquiring any locks or invoking any operations at thereceiver. However, if a machine sends too many updates before theprevious ones are consumed, the previous updates are over-written.

VOL inherits scatter and gather calls from dstorm to send the vector toremote machine and gather all the received updates (from local memory).Developers can also specify where to send the unit updates withinscatter calls. This provides fine-grained access to data flow to thedevelopers, allowing greater flexibility. Table 1 describes the VOL API.In Section 4, we describe how this API can be used to easily convertserial ML methods to data-parallel.

Consistency Guarantees:

We now describe the consistency guarantees that MALT provides whentransmitting unit updates to other replicas. With machine learningapplications, which are stochastic in nature, unit updates maybe beover-written or updated locklessly without affecting overall accuracy ofthe unit output significantly. For example, Hogwild demonstrates thatasynchronous, lock-less unit updates lead to units that ultimatelyconverge to acceptable accuracy. Hence, MALT need not provide strictconsistency guarantees when sending unit updates over infiniBand (ex. asin key-value stores). However, since MALT is a general-purpose API, itprovides mechanisms to deal with following inconsistency issues:

1. Torn reads: When a unit replica sends a unit update to another unitreplica, the sender may overwrite the unit update while the receiver isreading it in the case where the replicas operate asynchronously and thereceive queue is full. MALT provides an additional atomic gather whichreads the shared memory in an atomic fashion.

2. Stale replicas: Unit updates carry an iteration count information inthe header. When a receiver realizes that a specific unit update isarriving too slowly, the receiver may stall its operations until thesender catches up. This design is similar to the bounded-stalenessapproach explored by recent work.

If stricter guarantees are required, the unit replicas can trainsynchronously in bulk-synchronous fashion and use the barrier constructto do so. The barrier construct is a conventional barrier which waitsfor all unit replicas to arrive at a specific point in the trainingprocess.

MALT has a straightforward unit for fault tolerance. The training datais present on all machines in a distributed file system. The unitreplicas train in parallel and perform one-sided writes to all peers inthe communication. A fault monitor on every network node examines thereturn values of asynchronous writes to sender-side queues. If the faultmonitor observes failed writes, it performs a synchronous health checkof the cluster with other monitors on other network nodes. A networknode is considered dead if the network node is corrupt (the sharedmemory or the queue has failed) and the remote fault monitor reportsthis, or if the network node is unreachable by any of the other healthynetwork node's fault monitor. Furthermore, to detect the failure casesthat do not result in a machine or a process crash, local fault monitorscan detect processor exceptions such as divide by zero, stackcorruption, invalid instructions and segmentation faults and terminatethe local training process.

In case of a failure, the working fault monitors create a group ofsurvivor network nodes to ensure that all future group operations suchas barrier, skip the failed network nodes. The RDMA interface isre-registered (with old memory descriptors) and the queues are re-built.This is to avoid a zombie situation where a dead network node may comeback and attempt to write to one of the previously registered queues.Finally, the send and receive lists of all unit replicas are rebuilt toskip the failed network nodes and the training is resumed. Since thesend and receive lists are rebuilt, it is possible to re-run any MALTconfiguration on a smaller number of network nodes. If there is anetwork partition, training resumes on both clusters independently.However, it is possible to halt the training if the partition results ina cluster with very few network nodes.

After recovery, if an acceptable loss value is not achieved, thetraining continues on the survivor replicas with additional trainingexamples until the units converge. This causes a slowdown in thetraining process proportional to the missing machines apart from a shortdelay to synchronize and perform recovery (of the order of seconds).MALT only provides fail-stop fault tolerance, i.e. it can only handlesfailures where a fault monitor detects corruption or is unresponsivebecause of the MALT process being killed or a machine failure or anetwork failure. MALT cannot handle Byzantine failures such as when amachine sends corrupt gradients or software corruption of scalar valuesthat cannot be detected by local fault monitors. MALT can afford asimple fault tolerance unit because it only provides data parallelismand does not split the unit across multiple machines. Furthermore, theunit training is stochastic and does not depend on whether the trainingexamples are processed in a specific order, or the training examples areprocessed more than once, or whether all the training examples have beenprocessed, as long as the unit achieves an acceptable accuracy.Furthermore, MALT implements peer-to-peer learning and does not have acentral master. As a result, it does not need complex protocols likePaxos to recover from master failures.

FIG. 2 shows an exmplary All-reduce exchange of unit updates. All arrowsindicate bi-directional communication. As number of network nodes (N)grow, total number of updates transmitted increases O(N²).

FIG. 3 shows an exemplary Halton-sequence exchange of unit updates(N=6). Each ith machine sends updates to log(N) (2 for N=6) networknodes. (to N/2+i and N/4+i). As number of network nodes N increases, theoutbound network nodes follows Halton sequence (N/2, N/4, 3N/4, N/8,3N/8 . . . ). All arrows are uni-direction. As number of network nodes(N) grow, total number of updates transmitted increases O(N log N).

MALT's flexible API can unit different training configurations such asthe parameter server, mini-batching and peer-to-peer parameter mixing.

When MALT is trained using the peer-to-peer approach, each machine cansends its update to all the other machines to ensure that each unitreceives the most recent updates. We refer to this configuration asMALT_(all). As the number of network nodes (N) increases, the gradientcommunication overhead in MALT_(all) increases O(N²) times, in a naiveall-reduce implementation. Efficient all-reduce primitives such as thebutterfly or tree style all-reduce, reduce the communication cost bypropagating the unit updates in a tree style. However, this increasesthe latency by a factor of the height of the tree. Furthermore, if theintermediate network nodes are affected by stragglers or failures, anefficient all-reduce makes recovery complex.

MALT provides an efficient mechanism to propagate unit updates, what werefer to as indirect propagation of unit updates. A developer may usethe MALT API to send unit updates to either all N network nodes or fewernetwork nodes k, (1≦k≦N). MALT facilitates choosing a value k such thata MALT replica (i) disseminates the updates across all the network nodeseventually; (ii) optimizes specific goals of the system such asfreshness, and balanced communication/computation ratio in the cluster.By eventually, we mean that over a period of time all the network nodesreceive unit updates from every other network node directly orindirectly via an intermediate network node. However, when choosing avalue k, less than N, the developer needs to ensure that thecommunication graph of all network nodes is connected.

Hence, instead of performing an all-reduce, MALT limits the reduceoperation to a subset of the connected network nodes. However, naivelyor randomly selecting what network nodes to send updates to may eitherleave out certain network nodes from receiving updates from specificnetwork nodes (a partitioned graph of network nodes) or may propagateupdates that may be too stale (a weakly connected network node graph).This may adversely affect the convergence in parallel learning units. Wenow describe how MALT can selectively distribute unit updates to ensurelow communication costs and uniform dissemination of unit updates.

MALT provides a pre-existing data flow that sends fewer unit updates andensures that all the units send/receive unit updates in a uniformfashion. To do so, every network node picks a network node in a uniformfashion to ensure that the updates are distributed across all networknodes. For example, if every network node propagates its updates to knetwork nodes (k<N), we pick the k network node IDs based on a uniformrandom sequence such as the Halton sequence that generates successivepoints that create a k-network node graph with good informationdispersal properties. Each network node only sends updates to networknodes and maintain a log(N) sized network node list. This network nodelist contains the network nodes to send updates to, generated using theHalton sequence. Hence, if we mark the individual network nodes intraining cluster as 1, . . . , N, Network node 1 sends its updates toand so on (the Halton sequence with base 2). Hence, in this scheme, thetotal updates sent in every iteration is only O(N log N). This schemeensures that the updates are sent uniformly across the range of networknodes. FIGS. 2 and 3 show the all-to-all and Halton communicationschemes. In case of a failure, the failed network node is removed, andthe send/receive lists are rebuilt.

Using MALT's network-efficient parallel unit training results in fasterunit training times. This happens because 1) The amount of datatransmitted is reduced. 2) The amount of time to compute the average ofgradients is reduced since the gradient is received from fewer networknodes. 3) In a synchronized implementation, this design reduces thenumber of incoming updates that each network node needs to wait for,before going on to the next iteration. Furthermore, our solution reducesthe need for high-bandwidth interfaces, reducing costs and freeing upthe network for other applications.

Instead of having each network node communicate with other networknodes, developers can program MALT to communicate with higher (or lower)number of network nodes. The key idea is to balance the communication(sending updates) with computation (computing gradients, applyingreceived gradients). Hence, MALT accepts a data flow graph as an inputwhile creating vectors for the unit parameters. However, the graph ofnetwork nodes needs to be connected otherwise the individual unitupdates from a network node may not propagate to remaining networknodes, and the units may diverge significantly from one another.

TABLE 1 MALT interface. g.scatter( ) performs one-sided RDMA writes ofgradient g to other machines. g.gather( ), a local operation, appliesaverage to the received gradients. g.barrier( ) makes the methodsynchronous MALT API call Purpose of the call g = Creates a globallyaccessible shared unit createVector(Type) parameter or gradient (unitupdate) vector. Type signifies sparse or dense. g.scatter Send unit (orjust unit updates) to (Dataflow Graph machines as described in graph(default sends to optional) all machines). g.gather Apply user-definedfunction func (like (func) average) over unit updates that have arrived(locally) and return a result. g.barrier ( ) Distributed barrieroperation to force synchronization. load_data Shard and load data fromHDFS/NFS (f) from file f.

The goal of MALT is to provide data-parallelism to any existing machinelearning software or method. Given the MALT library and a list ofmachines, developers launch multiple replicas of their existing softwarethat perform data-parallel learning.

MALT exposes an API as shown in Table 1. This API can be used to create(and port existing) ML applications for data-parallelism. To do so, thedeveloper creates a parameter or a gradient object using MALT API. Thedense object is stored as a float array and the sparse object is storedas key-value pairs.

FIG. 4 shows a serial SGD method (Method 1) and a parallel SGD writtenusing MALT (Method 2). In the serial method, the training method goesover entire data and for each training sample, it calculates theassociated gradient value. It then updates the unit parameters, based onthis gradient value.

In order to perform this training in a data-parallel fashion, thismethod can be re-written using MALT API (as shown in Method 2). Theprogrammer specifies the representation (sparse vs dense) and the dataflow (ALL—which represents all machines communicate unit updates toone-another, HALTON—which represents the network efficient API fromprevious section or the developer may specify an arbitrary graph—thatrepresents the data flow graph). When a job is launched using MALT, itruns this code on each machine. Each machine creates a gradient vectorobject using the MALT API, with the required representation properties(sparse vs dense), and creates communication queues with other machinesbased on the data flow specified, and creates receiving queues forincoming gradients.

The pseudo-code for a data-parallel machine learning using MALT isdiscussed next. The serial code of Method 1 is converted todata-parallel using MALT. All machines run the above code (in Method 2).Instead of average, user may specify a function to combine incominggradients/parameters. Optionally, g.barrier( ) may be used to run themethod in a synchronous fashion.

Procedure 1 Serial SGD 1: procedure SERIALSGD 2:   Gradient g; 3:  Parameter w; 4:   for epoch = 1 : maxEpochs do 5:     for i = 1 :maxData do 6:       g = cal_gradient(data[i]); 7:       w = w + g; 8:  return w

Procedure 2 Data-Parallel SGD with MALT  1: procedure PARALLELSGD  2:  maltGradient g(SPARSE, ALL);  3:   Parameter w;  4:   for epoch = 1 :maxEpochs do  5:     for i = 1 : maxData/totalMachines do  6:       g =cal_gradient(data[i]);  7:       g.scatter(ALL);  8:      g.gather(AVG);  9:       w = w + g; 10:   return w

Each machine trains over a subset of training data and computes thegradient value for each example. After training over each example (orbunch of examples), this gradient value is sent using the one-sided RDMAoperation. The method then computes an average of the received gradientsusing the gather function. Instead of an average, one can specify auser-defined function (UDF) to compute the resulting gradient from allincoming gradients. This is useful for methods where a simple averagingmay not work, such as SVM may require an additional re-scaling functionapart from performing an average over the incoming parameters. Thetraining finishes when all machines in the cluster finish training overlocal examples. The final parameter value w is identical across allmachines in the synchronous, all-all case. In other cases, w may differslightly across machines but is within an acceptable loss value. In suchcases, the parameters from any machines may be used as the final unit oran additional reduce can be performed over to obtain final parametervalues.

For more complex methods, such as neural networks, which requiresynchronizing parameters at every layer of neural network, each layer ofparameters is represented using a separate maltGradient and can have itsown data flow, representation and synchronous/asynchronous behavior.

Finally, it may be difficult to use the maltGradient allocation forcertain legacy applications that use their own data structures forparameters or gradients. For such opaque representations, where MALTcannot perform optimizations such as sparseness, developers directly usedstorm. dstorm provides low-level shared memory access with scatter andgather operations, allows managing the data flow and controlling thesynchronization. Furthermore, the opaque data structures need to providea serialization/de-serialization methods to copy in/out from dstorm.Developers can also implement unit-parallelism by carefully shardingtheir unit parameters over multiple dstorm objects.

We use the MALT API to make the following methods data-parallel.Currently, MALT allows programmers to extend or write programs in C++and Lua.

We explore distributed stochastic gradient descent methods over linearand convex problems using Support Vector Machines (SVM). We use LeonBottou's SVM-SGD. Each machine calculates the partial gradient and sendsit to other machines. Each machine averages the received gradients andupdates its unit weight vector (w) locally.

Matrix factorization involves partitioning a large matrix into its twosmaller matrices. This is useful for data composed of two sets ofobjects, and their interactions need to be quantified. As an example,movie rating data contains interactions between users and movies. Byunderstanding their interactions and calculating the underlying featuresfor every user, one can determine how a user may rate an unseen movie.To scale better, large-scale matrix factorization is not exact, andmethods approximate the factorizations. SGD gives good performance formatrix factorizations on a single machine, and we perform matrixfactorization using SGD across multiple machines. We implement Hogwildand extend it from a multi-core implementation to a multi-network nodeusing MALT. With Hogwild, the gather function is a replace operationthat overwrites parameters.

We train neural networks for text learning. The computation in a neuralnetwork occurs over multiple layers forming a network. The traininghappens in forward and backward passes. In the forward pass, the inputsamples are processed at each layer and fed forward into the network,finally returning a predicted result at the end of the network. Thedifference in the ground truth and this predicted result is used in theback propagation phase to update unit weights using the gradient descentmethod. Parallel training over neural networks is more difficult thanSVM for two reasons. First, a data-parallel neural network requiressynchronizing parameters for every layer. Second, finding the unitweights for neural networks is a non-convex problem. Hence, just sendingthe gradients is not sufficient as the parallel unit replicas maybestuck in different local minimas. Hence, gradient synchronization needsto be interleaved with whole unit synchronization. We use RAPID, andextend its neural-network library with MALT. RAPID is similar inarchitecture to Torch, and provides a C++ library with Lua front-end forscripting. MALT exports its calls with Lua bindings and integrates withRAPID.

MALT is implemented as a library, and is provided as a package toSVM-SGD, Hogwild and RAPID, allowing developers to use and extend MALT.dstorm is implemented over GASPI, that allows programming shared memoryover infiniBand. GASPI exposes shared memory segments and supportsone-sided RDMA operations. dstorm implements object creation, scatter,gather and other operations. dstorm hides all GASPI memory managementfrom the user and provides APIs for object creation, scatter/gather anddataflow. GASPI is similar to MPI, and MALT can be implemented over MPI.However, GASPI has superior performance to certain MPI implementations.

We implement the vector object library over dstorm that provides vectorabstractions, and provides other APIs for loading data, sparse and denserepresentations. Overall, MALT library is only 2366 LOC. To integratewith Lua, we have written Lua bindings (in Lua and C++) consisting of1722 LOC.

In summary, existing map-reduce frameworks are optimized for batchprocessing systems and ill-suited for tasks that are iterative,fine-grained and asynchronous. Recent scalable ML platforms forcedevelopers to learn a new programming environment and rewrite their MLsoftware. The goal of MALT is to efficiently provide data-parallelism toexisting ML software. Given a list of machines and MALT library, wedemonstrate that one can program ML methods, control the data-flow andsynchrony. We provide MALT library interface for procedural (C++) andscripting (Lua) languages and demonstrate data-parallel benefits withSVM, matrix factorization, and neural networks. MALT uses one-sided RDMAprimitives that reduce network processing costs and transmissionoverhead. The new generation of RDMA protocols provides additionalopportunities for optimizations. Primitives such as fetch_and_add can beused to perform gradient averaging in hardware and further decrease theunit training costs in software.

Although the techniques have been described in language specific tostructural features and/or methodological acts, it is to be understoodthat the appended claims are not necessarily limited to the features oracts described. Rather, the features and acts are described as exampleimplementations of such techniques.

Unless otherwise noted, all of the methods and processes described abovemay be embodied in whole or in part by software code modules executed byone or more general purpose computers or processors. The code modulesmay be stored in any type of computer-readable storage medium or othercomputer storage device. Some or all of the methods may alternatively beimplemented in whole or part by specialized computer hardware, such asFPGAs, ASICs, etc.

Conditional language such as, among others, “can,” “could,” “might” or“may,” unless specifically stated otherwise, are used to indicate thatcertain embodiments include, while other embodiments do not include, thenoted features, elements and/or steps. Thus, unless otherwise stated,such conditional language is not intended to imply that features,elements, and/or steps are in any way required for one or moreembodiments or that one or more embodiments necessarily include logicfor deciding, with or without user input or prompting, whether thesefeatures, elements, and/or steps are included or are to be performed inany particular embodiment.

Conjunctive language such as the phrase “at least one of X, Y or Z,”unless specifically stated otherwise, is to be understood to presentthat an item, term, etc. may be either X, or Y, or Z, or a combinationthereof.

Many variations and modifications may be made to the above-describedembodiments, the elements of which are to be understood as being amongother acceptable examples. All such modifications and variations areintended to be included herein within the scope of this disclosure.

What is claimed is:
 1. A method for machine learning, comprising: with acluster of N parallel machine network nodes, determining k network nodesas a subset of the N network nodes to update learning parameters,wherein k is selected to disseminate the updates across all nodesdirectly or indirectly and to optimize predetermined goals includingfreshness, balanced communication and computation ratio in the cluster;sending learning unit updates to fewer nodes to reduce communicationcosts with learning convergence; and sending reduced learning updatesand ensuring that the nodes send/receive learning updates in a uniformfashion.
 2. The method of claim 1, comprising training the parallelnetwork nodes in a peer-peer fashion with per-sender receive queues. 3.The method of claim 1, comprising performing uniform dissemination ofnetwork node updates with Halton series.
 4. The method of claim 3,wherein each learning node is selected in a uniform fashion to ensurethat updates are distributed across all nodes.
 5. The method of claim 3,wherein each learning node propagates updates to k nodes, comprisingpicking k node IDs based on a uniform random sequence that generatessuccessive points that uniformly cover an ID range.
 6. The method ofclaim 5, wherein the sequence comprises a Halton sequence.
 7. The methodof claim 6, comprising maintain at each network node list of log(N)nodes and wherein the node list contains nodes in a Halton series tosend updates.
 8. The method of claim 1, comprising marking each networknode as 1 . . . N, wherein node 1 sends updates to a Halton seriesincluding N/2, N/4, 3N/4, N/8, 3N/8, 5N/8.
 9. The method of claim 1,comprising sending total updates in every iteration with a
 2. The methodof claim 1, comprising training the parallel network nodes in apeer-peer fashion with per-sender receive queues.
 3. The method of claim1, comprising performing uniform dissemination of network node updateswith Halton series.
 4. The method of claim 3, wherein each learning nodeis selected in a uniform fashion to ensure that updates are distributedacross all nodes.
 5. The method of claim 3, wherein each learning nodepropagates updates to k nodes, comprising picking k node IDs based on auniform random sequence that generates successive points that uniformlycover an ID range.
 6. The method of claim 5, wherein the sequencecomprises a Halton sequence.
 7. The method of claim 6, comprisingmaintain at each network node list of log(N) nodes and wherein the nodelist contains nodes in a Halton series to send updates.
 8. The method ofclaim 1, comprising marking each network node as 1 . . . N, wherein node1 sends updates to a Halton series including N/2, N/4, 3N/4, N/8, 3N/8,5N/8.
 9. The method of claim 1, comprising sending total updates inevery iteration with a complexity of O(N log(N)).
 10. The method ofclaim 1, comprising ensuring updates are sent uniformly across a rangeof nodes using any uniform random sequence other than Halton.
 11. Themethod of claim 1, comprising: picking a value for a number of outboundcommunication nodes; and balancing computation with communication inparallel learning node.
 12. The method of claim 1, comprising balancingcommunication and computation costs in the cluster.
 13. The method ofclaim 12, comprising overlapping communication with computation in thecluster for maximum concurrency.
 14. The method of claim 1, comprisingchoosing ‘k’ nodes such that a time to communicate k updates is equal toa time to compute model updates for next iteration.
 15. A machinelearning system, comprising: a plurality of parallel networked nodeseach including a processor; a network card coupling the processor to theother learning machine nodes; and computer readable code executable bythe processor for: determining k nodes as a subset of nodes to updatemodel updates with 1<=k<=N, wherein k is chosen to disseminate theupdates across all nodes directly or indirectly and to optimizepredetermined goals of the system including freshness, balancedcommunication and computation ratio in the cluster; and sending modelupdates to fewer nodes reduces communication costs but the overall modelstill converges; sending reduced model updates and ensuring that modelssend/receive model updates in a uniform fashion.
 16. The system of claim15, wherein the parallel network nodes communicate in a peer-peerfashion with per-sender receive queues.
 17. The system of claim 15,wherein the nodes perform uniform dissemination of network node updateswith Halton series.
 18. The system of claim 17, wherein each learningnode is selected in a uniform fashion to ensure that updates aredistributed across all nodes.
 19. The system of claim 17, wherein eachlearning node propagates updates to k nodes, comprising picking k nodeIDs based on a uniform random sequence that generates successive pointsthat uniformly cover an ID range and wherein each network node is markedas 1 . . . N, and node 1 sends updates to a Halton series including N/2,N/4, 3N/4, N/8, 3N/8, 5N/8.
 20. The system of claim 15, comprising avector library to create unit parameters or gradients (updates toparameters) to be synchronized across network nodes and the vectorscommunicate over a DiSTributed One-sided Remote Memory (dstorm).